fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

cheniujh · 2024-05-21T12:49:57Z

该 PR 修复了 Issue #2665

引发 Issue #2665 的原因：

1 当主节点和从节点由于超时断开连接时，主节点应该清除与超时DB相关的 Binlog-WriteQueue，但实际上主节点清空了与超时从节点相关的所有 WriteQueues（这些WriteQueue可能有的还在被处于Connected状态的DB所使用，正在传输Binlog）。举例：DB0发生了超时，DB1主从连接还正常且正在增量同步，DB0在超时处理的部分会一块把DB1使用的WriteQueue给清空，正确的行为是只清空DB0所对应的WriteQueue就好。
2 类似于1，当从节点和主节点刚建立增量同步连接时，从节点会发送一个特殊的 "first-binlog-ack" 给主节点，以告知主节点该从哪里续传Binlog。主节点正确的操作是重置/清空发送了 "first-binlog-ack" 的那个DB所对应的 writeQueue，但主节点实际上重置/清空了发送"first-binlog-ack"的那个Slave对应的所有writeQueue。举例：DB0先行建立了连接进行增量同步，DB1没多久也建立了增量同步关系，但是DB1的first-binlog-ack会把DB0的WriteQueue也清空，问题是此时DB0的WriteQueue很有可能里面有内容。
3 由于1，2提到的，Master意外清空了不相关的 WriteQueues 中的 binlog 项，主节点会漏发一批本该发送的 binlog（在WriteQueue中被意外清空的那些Binlog就是漏发的）。这也是为什么主节点会收到一个 AckEnd 小于 AckStart 的 BinlogAck（因为从节点的最新 binlog 偏移远远落后于正确的预期）。

该 PR 如何修复此问题：

通过添加 "DropItemInOneWriteQueue" 函数来在上述场景中替代"DropItemInWriteQueue"，确保主节点不会在上述场景中清空不相关的 WriteQueue。

This PR fixes Issue #2665

Causes of Issue #2665:

1 When the master and slave nodes disconnect due to a timeout, the master is supposed to clear the Binlog-WriteQueue related to the timed-out DB. However, the master actually clears all WriteQueues associated with the timed-out SlaveNode, even those still being used by connected DBs for Binlog transmission. Example: DB0 experiences a timeout while DB1's master-slave connection is still active and performing incremental synchronization. The timeout handling for DB0 also clears DB1's WriteQueue. The correct behavior should be to clear only the WriteQueue corresponding to DB0.
2 Similar to point 1, when the slave and master nodes establish an incremental synchronization connection, the slave sends a special "first-binlog-ack" to inform the master where to resume Binlog transmission. The correct action for the master is to reset/clear the WriteQueue for the DB associated with the "first-binlog-ack," but the master mistakenly resets/clears all WriteQueues for the SlaveNode that sent the "first-binlog-ack." Example: DB0 establishes a connection for incremental synchronization first, followed shortly by DB1. However, the "first-binlog-ack" from DB1 also clears the WriteQueue for DB0, which may still contain data.
3 Due to the issues mentioned in points 1 and 2, the master unintentionally clears binlog items from unrelated WriteQueues, causing the master to miss sending some binlogs. This is why the master receives a BinlogAck with AckEnd smaller than AckStart (the slave's latest binlog offset is far behind the correct expectation).

How this PR fixes the Issue:

By adding a function "DropItemInOneWriteQueue" to replace "DropItemInWriteQueue" in the scenarios described above, ensuring that the master does not clear unrelated WriteQueues.

…the Master clean un-relevant WriteQueue when one DB timeout)

src/pika_rm.cc

…the Master clean un-relevant WriteQueue when one DB timeout) (OpenAtomFoundation#2666) Co-authored-by: cjh <[email protected]>

…the Master clean un-relevant WriteQueue when one DB timeout) (#2666) Co-authored-by: cjh <[email protected]>

…the Master clean un-relevant WriteQueue when one DB timeout) (OpenAtomFoundation#2666) Co-authored-by: cjh <[email protected]>

fix the problem that BinlogAckEnd smaller than BinlogAckStart(due to …

82a84f1

…the Master clean un-relevant WriteQueue when one DB timeout)

github-actions bot added ☢️ Bug Something isn't working ✏️ Feature New feature or request labels May 21, 2024

cheniujh requested review from wangshao1, AlexStocks, chejinge and baixin01 May 21, 2024 13:25

AlexStocks reviewed May 22, 2024

View reviewed changes

src/pika_rm.cc Show resolved Hide resolved

cheniujh mentioned this pull request May 22, 2024

fix: make SlaveDB stay in WaitDBSync state instead of sink into Error State if rsync init failed #2667

Merged

cheniujh added 4.0.0 3.5.5 and removed ✏️ Feature New feature or request labels May 22, 2024

baixin01 approved these changes May 23, 2024

View reviewed changes

wangshao1 approved these changes May 23, 2024

View reviewed changes

chejinge approved these changes May 23, 2024

View reviewed changes

AlexStocks merged commit 6cd3e64 into OpenAtomFoundation:unstable May 24, 2024
25 checks passed

cheniujh deleted the AckEnd_smaller_than_AckStart branch June 24, 2024 03:21

cheniujh restored the AckEnd_smaller_than_AckStart branch June 25, 2024 11:59

chejinge pushed a commit that referenced this pull request Jul 31, 2024

fix the problem that BinlogAckEnd smaller than BinlogAckStart(due to …

9d443d2

…the Master clean un-relevant WriteQueue when one DB timeout) (#2666) Co-authored-by: cjh <[email protected]>

AlexStocks mentioned this pull request Aug 15, 2024

docs:add 355 changelog #2867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

cheniujh commented May 21, 2024 •

edited

Loading

fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

fix: add function DropItemInOneWriteQueue to do the accurate queue clear when slave and master disconnect due to timeout #2666

Conversation

cheniujh commented May 21, 2024 • edited Loading

cheniujh commented May 21, 2024 •

edited

Loading